wget

Section: User Commands (1)
Updated: 1996 Nov 11
Index Return to Main Contents

NAME

wget - a utility to retrieve files from the World Wide Web

SYNOPSIS

wget [options] [URL-list]

WARNING

The information in this man page is an extract from the full documentation of Wget. It could very well be out of date. Please refer to the info page for full, up-to-date documentation. You can view the info documentation with the Emacs info subsystem or the standalone info program.

DESCRIPTION

Wget is a utility designed for retrieving binary documents across the Web, through the use of HTTP (Hyper Text Transfer Protocol) and FTP (File Transfer Protocol), and saving them to disk. Wget is non-interactive, which means it can work in the background, while the user is not logged in, unlike most of web browsers (thus you may start the program and log off, letting it do its work). Analysing server responses, it distinguishes between correctly and incorrectly retrieved documents, and retries retrieving them as many times as necessary, or until a user-specified limit is reached. REST is used in FTP on hosts that support it. Proxy servers are supported to speed up the retrieval and lighten network load.

Wget supports a full-featured recursion mechanism, through which you can retrieve large parts of the web, creating local copies of remote directory hierarchies. Of course, maximum level of recursion and other parameters can be specified. Infinite recursion loops are always avoided by hashing the retrieved data. All of this works for both HTTP and FTP.

The retrieval is conveniently traced with printing dots, each dot representing one kilobyte of data received. Builtin features offer mechanisms to tune which links you wish to follow (cf. -L, -D and -H).

URL CONVENTIONS

Most of the URL conventions described in RFC1738 are supported. Two alternative syntaxes are also supported, which means you can use three forms of address to specify a file:

Normal URL (recommended form):

http://host[:port]/path
http://fly.cc.fer.hr/
ftp://ftp.xemacs.org/pub/xemacs/xemacs-19.14.tar.gz
ftp://username:password@host/dir/file

FTP only (ncftp-like): hostname:/dir/file

HTTP only (netscape-like):
hostname(:port)/dir/file

You may encode your username and/or password to URL using the form:

ftp://user:password@host/dir/file

If you do not understand these syntaxes, just use the plain ordinary syntax with which you would call lynx or netscape. Note that the alternative forms are deprecated, and may cease being supported in the future.

OPTIONS

There are quite a few command-line options for wget. Note that you do not have to know or to use them unless you wish to change the default behaviour of the program. For simple operations you need no options at all. It is also a good idea to put frequently used command-line options in .wgetrc, where they can be stored in a more readable form.

This is the complete list of options with descriptions, sorted in descending order of importance:

-h --help: Print a help screen. You will also get help if you do not supply command-line arguments.

-V --version: Display version of wget.

-v --verbose: Verbose output, with all the available data. The default output consists only of saving updates and error messages. If the output is stdout, verbose is default.

-q --quiet: Quiet mode, with no output at all.

-d --debug: Debug output, and will work only if wget was compiled with -DDEBUG. Note that when the program is compiled with debug output, it is not printed unless you specify -d.

-i filename --input-file=filename: Read URL-s from filename, in which case no URL-s need to be on the command line. If there are URL-s both on the command line and in a filename, those on the command line are first to be retrieved. The filename need not be an HTML document (but no harm if it is) - it is enough if the URL-s are just listed sequentially.
However, if you specify --force-html, the document will be regarded as HTML. In that case you may have problems with relative links, which you can solve either by adding <base href="url"> to the document or by specifying --base=url on the command-line.

-o logfile --output-file=logfile: Log messages to logfile, instead of default stdout. Verbose output is now the default at logfiles. If you do not wish it, use -nv (non-verbose).

-a logfile --append-output=logfile: Append to logfile - same as -o, but appends to a logfile (or creating a new one if the old does not exist) instead of rewriting the old log file.

-t num --tries=num: Set number of retries to num. Specify 0 for infinite retrying.

-f --follow-ftp: Follow FTP links from HTML documents.

-c --continue-ftp: Continue retrieval of FTP documents, from where it was left off. If you specify "wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z", and there is already a file named ls-lR.Z in the current directory, wget continue retrieval from the offset equal to the length of the existing file. Note that you do not need to specify this option if the only thing you want is wget to continue retrieving where it left off when the connection is lost - wget does this by default. You need this option when you want to continue retrieval of a file already halfway retrieved, saved by other FTP software, or left by wget being killed.

-g on/off --glob=on/off: Turn FTP globbing on or off. By default, globbing will be turned on if the URL contains a globbing characters (an asterisk, e.g.). Globbing means you may use the special characters (wildcards) to retrieve more files from the same directory at once, like wget ftp://gnjilux.cc.fer.hr/*.msg. Globbing currently works only on UNIX FTP servers.

-e command --execute=command: Execute command, as if it were a part of .wgetrc file. A command invoked this way will take precedence over the same command in .wgetrc, if there is one.

-N --timestamping: Use the so-called time-stamps to determine whether to retrieve a file. If the last-modification date of the remote file is equal to, or older than that of local file, and the sizes of files are equal, the remote file will not be retrieved. This option is useful for weekly mirroring of HTTP or FTP sites, since it will not permit downloading of the same file twice.

-F --force-html: When input is read from a file, force it to be HTML. This enables you to retrieve relative links from existing HTML files on your local disk, by adding <base href> to HTML, or using --base.

-B base_href --base=base_href: Use base_href as base reference, as if it were in the file, in the form <base href="base_href">. Note that the base in the file will take precedence over the one on the command-line.

-r --recursive

Recursive web-suck. According to the protocol of the URL, this can mean two things. Recursive retrieval of a HTTP URL means that Wget will download the URL you want, parse it as an HTML document (if an HTML document it is), and retrieve the files this document is referring to, down to a certain depth (default 5; change it with -l). Wget will create a hierarchy of directories locally, corresponding to the one found on the HTTP server.

This option is ideal for presentations, where slow connections should be bypassed. The results will be especially good if relative links were used, since the pages will then work on the new location without change.

When using this option with an FTP URL, it will retrieve all the data from the given directory and subdirectories, similar to HTTP recursive retrieval.

You should be warned that invoking this option may cause grave overloading of your connection, and your system administrator may choose not to enable it. The load can be minimized by lowering the maximal recursion level (see -l) and/or by lowering the number of retries (see -t).

-m --mirror: Turn on mirroring options. This will set recursion and time-stamping, combining -r and -N.

-l depth --level=depth: Set recursion depth level to the specified level. Default is 5. After the given recursion level is reached, the sucking will proceed from the parent. Thus specifying -r -l1 should equal a recursion-less retrieve from file. Setting the level to zero makes recursion depth (theoretically) unlimited. Note that the number of retrieved documents will increase exponentially with the depth level.

-H --span-hosts: Enable spanning across hosts when doing recursive retrieving. See -r and -D. Refer to FOLLOWING LINKS for a more detailed description.

-L --relative: Follow only relative links. Useful for retrieving a specific homepage without any distractions, not even those from the same host. Refer to FOLLOWING LINKS for a more detailed description.

-D domain-list --domains=domain-list: Set domains to be accepted and DNS looked-up, where domain-list is a comma-separated list. Note that it does not turn on -H. This speeds things up, even if only one host is spanned. Refer to FOLLOWING LINKS for a more detailed description.

-A acclist / -R rejlist --accept=acclist / --reject=rejlist: Comma-separated list of extensions to accept/reject. For example, if you wish to download only GIFs and JPEGs, you will use -A gif,jpg,jpeg. If you wish to download everything except cumbersome MPEGs and .AU files, you will use -R mpg,mpeg,au.
-X list --exclude-directories list: Comma-separated list of directories to exclude from FTP fetching.

-P prefix --directory-prefix=prefix: Set directory prefix ("." by default) to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to.

-p --prefix-files: Set prefixed files. By default, Wget will save URLs to appropriate filenames (e.g. http://yoyodine.com/sharon.gif will be written to "sharon.gif"). With -p turned on, if a file with the same name already exists, try "filename.1". If this one also exists, try "filename.2", etc. This option turns this behaviour off, saving all your files to "received.n" where n is a number, 1 or greater. This is sometimes handy for managing large number of files that you can easily reconstruct. Set file_prefix to change the prefix and -P to change directory.

-T value --timeout=value: Set the read timeout to a specified value. Whenever a read is issued, the file descriptor is checked for a possible timeout, which could otherwise leave a pending connection (uninterrupted read). The default timeout is 900 seconds (fifteen minutes).

-Y on/off --proxy=on/off: Turn proxy on or off. The proxy is on by default if the appropriate environmental variable is defined.

-Q quota[KM] --quota=quota[KM]: Specify download quota, in bytes (default), kilobytes or megabytes. More useful for rc file. See below.

-O filename --output-document=filename: The documents will not be written to the appropriate files, but all will be appended to a unique file name specified by this option. The number of tries will be automatically set to 1. If this filename is `-', the documents will be written to stdout, and --quiet will be turned on. Use this option with caution, since it turns off all the diagnostics Wget can otherwise give about various errors.

-S --server-response: Print the headers sent by the HTTP server and/or responses sent by the FTP server.

-s --save-headers: Save the headers sent by the HTTP server to the file, before the actual contents.

--header=additional-header: Define an additional header. You can define more than additional headers. Do not try to terminate the header with CR or LF.

--http-user --http-passwd: Use these two options to set username and password Wget will send to HTTP servers. Wget supports only the basic WWW authentication scheme.

-nc: Do not clobber existing files when saving to directory hierarchy within recursive retrieval of several files. This option is extremely useful when you wish to continue where you left off with retrieval. If the files are .html or (yuck) .htm, it will be loaded from the disk, and parsed as if they have been retrieved from the Web.

-nv: Non-verbose - turn off verbose without being completely quiet (use -q for that), which means that error messages and basic information still get printed.

-nd: Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the filenames will get extensions .n).

-x: The opposite of -nd -- Force creation of a hierarchy of directories even if it would not have been done otherwise.

-nh: Disable time-consuming DNS lookup of almost all hosts. Refer to FOLLOWING LINKS for a more detailed description.

-nH: Disable host-prefixed directories. By default, http://fly.cc.fer.hr/ will produce a directory named fly.cc.fer.hr in which everything else will go. This option disables such behaviour.

--no-parent: Do not ascend to parent directory.

-k --convert-links: Convert the non-relative links to relative ones locally.

FOLLOWING LINKS

Recursive retrieving has a mechanism that allows you to specify which links wget will follow.

Only relative links: When only relative links are followed (option -L), recursive retrieving will never span hosts. will never get called, and the process will be very fast, with the minimum strain of the network. This will suit your needs most of the time, especially when mirroring the output the output of *2html converters, which generally produce only relative links.

Host checking: The drawback of following the relative links solely is that humans often tend to mix them with absolute links to the very same host, and the very same page. In this mode (which is the default), all URL-s that refer to the same host will be retrieved.
The problem with this options are the aliases of the hosts and domains. Thus there is no way for wget to know that regoc.srce.hr and www.srce.hr are the same hosts, or that fly.cc.fer.hr is the same as fly.cc.etf.hr. Whenever an absolute link is encountered, gethostbyname is called to check whether we are really on the same host. Although results of gethostbyname are hashed, so that it will never get called twice for the same host, it still presents a nuisance e.g. in the large indexes of difference hosts, when each of them has to be looked up. You can use -nh to prevent such complex checking, and then wget will just compare the hostname. Things will run much faster, but also much less reliable.

Domain acceptance: With the -D option you may specify domains that will be followed. The nice thing about this option is that hosts that are not from those domains will not get DNS-looked up. Thus you may specify -Dmit.edu, just to make sure that nothing outside .mit.edu gets looked up. This is very important and useful. It also means that -D does not imply -H (it must be explicitly specified). Feel free to use this option, since it will speed things up greatly, with almost all the reliability of host checking of all hosts.
Of course, domain acceptance can be used to limit the retrieval to particular domains, but freely spanning hosts within the domain, but then you must explicitly specify -H.

All hosts: When -H is specified without -D, all hosts are being spanned. It is useful to set the recursion level to a small value in those cases. Such option is rarely useful.

FTP: The rules for FTP are somewhat specific, since they have to be. To have FTP links followed from HTML documents, you must specify -f (follow_ftp). If you do specify it, FTP links will be able to span hosts even if span_hosts is not set. Option relative_only (-L) has no effect on FTP. However, domain acceptance (-D) and suffix rules (-A/-R) still apply.

STARTUP FILE

Wget supports the use of initialization file .wgetrc. First a system-wide init file will be looked for (/usr/local/lib/wgetrc by default) and loaded. Then the user's file will be searched for in two places: In the environmental variable WGETRC (which is presumed to hold the full pathname) and $HOME/.wgetrc. Note that the settings in user's startup file may override the system settings, which includes the quota settings (he he).

The syntax of each line of startup file is simple:

variable = value

Valid values are different for different variables. The complete set of commands is listed below, the letter after equation-sign denoting the value the command takes. It is on/off for on or off (which can also be 1 or 0), string for any string or N for positive integer. For example, you may specify "use_proxy = off" to disable use of proxy servers by default. You may use inf for infinite value (the role of 0 on the command line), where appropriate. The commands are case-insensitive and underscore-insensitive, thus File__Prefix is the same as fileprefix. Empty lines, lines consisting of spaces, or lines beginning with '#' are skipped.

Most of the commands have their equivalent command-line option, except some more obscure or rarely used ones. A sample init file is provided in the distribution, named sample.wgetrc.

accept/reject = string: Same as -A/-R.
add_hostdir = on/off: Enable/disable host-prefixed hostnames. -nH disables it.
always_rest = on/off: Enable/disable continuation of the retrieval, the same as -c.
base = string: Set base for relative URL-s, the same as -B.
convert links = on/off: Convert non-relative links locally. The same as -k.
debug = on/off: Debug mode, same as -d.
dir_mode = N: Set permission modes of created subdirectories (default is 755).
dir_prefix = string: Top of directory tree, the same as -P.
dirstruct = on/off: Turning dirstruct on or off, the same as -x or -nd, respectively.
domains = string: Same as -D.
file_prefix = string: Set prefix for output files. It works only if -p is set. Wget saves all the files retrieved from the net in files received.n, where n is a number. If a file named received.n exists, wget tries with n + 1, and so forth. The file_prefix option changes the default "received" prefix.
follow_ftp = on/off: Follow FTP links from HTML documents, the same as -f.
force_html = on/off: If set to on, force the input filename to be regarded as an HTML document, the same as -F.
ftp_proxy = string: Use the string as FTP proxy, instead of the one specified in environment.
glob = on/off: Turn globbing on/off, the same as -g.
header = string: Define an additional header, like --header.
http_passwd = string: Set HTTP password.
http_proxy = string: Use the string as HTTP proxy, instead of the one specified in environment.
http_user = string: Set HTTP user.
input = string: Read the URL-s from filename, like -i.
kill_longer = on/off: Consider data longer than specified in content-length header as invalid (and retry getting it). The default behaviour is to save as much data as there is, provided there is more than or equal to the value in content-length.
logfile = string: Set logfile, the same as -o.
login = string: Your user name on the remote machine, for FTP. Defaults to "anonymous".
mirror = on/off: Turn mirroring on/off. The same as -m.
noclobber = on/off: Same as -nc.
no_parent = on/off: Same as --no-parent.
no_proxy = string: Use the string as the comma-separated list of domains to avoid in proxy loading, instead of the one specified in environment.
num_tries = N: Set number of retries per URL, the same as -t.
output_document = string: Set the output filename, the same as -O.
passwd = string: Your password on the remote machine, for FTP. Defaults to username@hostname.domainname.
prefix_files = on/off: Set prefixed files, the same as -p.
quiet = on/off: Quiet mode, the same as -q.
quota = quota: Specify the download quota, which is useful to put in /usr/local/lib/wgetrc. When download quota is specified, wget will stop retrieving after the download sum has become greater than quota. The quota can be specified in bytes (default), kbytes ('k' appended) or mbytes ('m' appended). Thus "quota = 5m" will set the quota to 5 mbytes. Note that the user's startup file overrides system settings.
reclevel = N: Recursion level, the same as -l.
recursive = on/off: Recursive on/off, the same as -r.
relative_only = on/off: Follow only relative links (the same as -L). Refer to section FOLLOWING LINKS for a more detailed description.
robots = on/off: Use (or not) robots.txt file.
server_response = on/off: Choose whether or not to print the HTTP and FTP server responses, the same as -S.
simple_host_check = on/off: Same as -nh.
span_hosts = on/off: Same as -H.
timeout = N: Set timeout value, the same as -T.
timestamping = on/off: Turn timestamping on/off. The same as -N.
use_proxy = on/off: Turn proxy support on/off. The same as -Y.
verbose = on/off: Turn verbose on/off, the same as -v/-nv.

SIGNALS

Wget will catch the SIGHUP (hangup signal) and ignore it. If the output was on stdout, it will be redirected to a file named wget-log. This is also convenient when you wish to redirect the output of Wget interactively.

$ wget http://www.ifi.uio.no/~larsi/gnus.tar.gz &
$ kill -HUP %%       # to redirect the output

Wget will not try to handle any signals other than SIGHUP. Thus you may interrupt Wget using ^C or SIGTERM.

EXAMPLES

Get URL http://fly.cc.fer.hr/:
wget http://fly.cc.fer.hr/

Force non-verbose output:
wget -nv http://fly.cc.fer.hr/

Unlimit number of retries:
wget -t0 http://www.yahoo.com/

Create a mirror image of fly's web (with the same directory structure
the original has), up to six recursion levels, with only one try per
document, saving the verbose output to log file 'log':
wget -r -l6 -t1 -o log http://fly.cc.fer.hr/

Retrieve from yahoo host only (depth 50):
wget -r -l50 http://www.yahoo.com/

ENVIRONMENT

http_proxy, ftp_proxy, no_proxy, WGETRC, HOME

FILES

/usr/local/lib/wgetrc, $HOME/.wgetrc

UNRESTRICTIONS

Wget is free; anyone may redistribute copies of Wget to anyone under the terms stated in the General Public License, a copy of which accompanies each copy of Wget.

AUTHOR

Hrvoje Niksic <hniksic@srce.hr> is the author of Wget. Thanks to the beta testers and all the other people who helped with useful suggestions.

This document was created by man2html, using the manual pages.
Time: 05:58:36 GMT, July 26, 2024